R
WorkflowDownload
.Rmd (won’t work in Safari or IE)
See
GitHub Repository
Data are the core of everything that we do in statistical analysis.
Data come in many forms, and I don’t just mean .csv,
.xls, .sav, etc. Data can be wide, long,
documented, fragmented, messy, and about anything else that you can
imagine.
Although data could arguably be more means than end in psychology, the importance of understanding the structure and format of your data cannot overstated. Failure to understand your data could end in improper techniques and flagrantly wrong inferences at worst.
In this tutorial, we are going to talk data management and basic data cleaning. Earlier tutorials will go more in depth into data cleaning and reshaping. This tutorial is meant to take what you learned in those and to help you to think about those in more nuanced ways and develop a functional workflow for conducting your own research.
A good workflow starts by keeping your files organized outside of
R. A typical research project typically involves:
You can set this up outside of R, but I’m going to
quickly show you how to set up inside R.
data_path <- "~/Desktop/06_workflow"
c("data", "scripts", "results", "manuscript", "experimental materials",
"preregistration", "papers") %>%
paste(data_path, ., sep = "/") %>%
map(., dir.create)When I create an rmarkdown document for my own research
projects, I always start by setting up my my workspace. This involves 3
steps:
Below, we will step through each of these separately, setting
ourselves up to (hopefully) flawlessly communicate with R
and our data.
Packages seems like the most basic step, but it is actually very important. ALWAYS LOAD YOUR PACKAGES IN A VERY INTENTIONAL ORDER AT THE BEGINNING OF YOUR SCRIPT. Package conflicts suck, so it needs to be shouted.
For this tutorial, we are going to quite simple. We will load the
psych package for data descriptives, some options for
cleaning and reverse coding, and some evaluations of our scales. The
plyr package is the predecessor of the dplyr
package, which is a core package of the tidyverse, which
you will become quite familiar with in these tutorials. I like the
plyr package because it contains a couple of functions
(e.g. mapvalues()) that I find quite useful. Finally, we
load the tidyverse package, which is actually a
complilation of 8 packages. Some of these we will use today and some we
will use in later tutorials. All are very useful and are arguably some
of the most powerful tools R offers.
# load packages
library(psych)
library(plyr)
library(tidyverse)The second step is a codebook. Arguably, this is the first step
because you should create the codebook long before you open
R and load your data.
In this case, we are going to using some data from the German Socioeconomic Panel Study (GSOEP), which is an ongoing Panel Study in Germany. Note that these data are for teaching purposes only, shared under the license for the Comprehensive SOEP teaching dataset, which I, as a contracted SOEP user, can use for teaching purposes. These data represent select cases from the full data set and should not be used for the purpose of publication. The full data are available for free at https://www.diw.de/en/diw_02.c.222829.en/access_and_ordering.html.
For this tutorial, I created the codebook for you (Download (won’t work in Safari or IE)), and included what I believe are the core columns you may need. Some of these columns may not be particularly helpful for every dataset.
Here are my core columns that are based on the original data:
1. dataset: this column indexes the
name of the dataset that you will be pulling the data
from. This is important because we will use this info later on (see
purrr tutorial) to load and clean specific data files. Even if you don’t
have multiple data sets, I believe consistency is more important and
suggest using this.
2. old_name: this column is the name of the variable in
the data you are pulling it from. This should be exact. The goal of this
column is that it will allow us to select() variables from the original
data file and rename them something that is more useful to us.
3. item_text: this column is the original text that
participants saw or a description of the item.
4. category: broad categories that different variables
can be put into. I’m a fan of naming them things like “outcome”,
“predictor”, “moderator”, “demographic”, “procedural”, etc. but
sometimes use more descriptive labels like “Big 5” to indicate the model
from which the measures are derived.
5. label: label is basically one level lower than
category. So if the category is Big 5, the label would be, or example,
“A” for Agreeableness, “SWB” for subjective well-being, etc. This column
is most important and useful when you have multiple items in a scales,
so I’ll typically leave this blank when something is a standalone
variable (e.g. sex, single-item scales, etc.).
6. item_name: This is the lowest level and most
descriptive variable. It indicates which item in scale something is. So
it may be “kind” for Agreebleness or “sex” for the demographic
biological sex variable.
7. new_name: This is a column that brings together much
of the information we’ve already collected. It’s purpose is to be the
new name that we will give to the variable that is more useful and
descriptive to us. This is a constructed variable that brings together
others. I like to make it a combination of “category”, “label”,
“item_name”, and year using varying combos of “_” and “.” that we can
use later with tidyverse functions. I typically construct this variable
in Excel using the CONCATENATE() function, but it could also be done in
R. The reason I do it in Excel is that it makes it easier
for someone who may be reviewing my codebook.
8. scale: this column tells you what the scale of the
variable is. Is it a numeric variable, a text variable, etc. This is
helpful for knowing the plausible range. 9. recode:
sometimes, we want to recode variables for analyses (e.g. for
categorical variables with many levels where sample sizes for some
levels are too small to actually do anything with it). I use this column
to note the kind of recoding I’ll do to a variable for transparency.
Here are additional columns that will make our lives easier or are
applicable to some but not all data sets:
10. reverse: this column tells you whether items in a
scale need to be reverse coded. I recommend coding this as 1 (leave
alone) and -1 (reverse) for reasons that will become clear later.
11. mini: this column represents the minimum value of
scales that are numeric. Leave blank otherwise.
12. maxi: this column represents the maximumv alue of
scales that are numeric. Leave blank otherwise.
13. year: for longitudinal data, we have several waves
of data and the name of the same item across waves is often different,
so it’s important to note to which wave an item belongs. You can do this
by noting the wave (e.g. 1, 2, 3), but I prefer the actual year the data
were collected (e.g. 2005, 2009, etc.)
15. meta: Some datasets have a meta name, which
essentially means a name that variable has across all waves to make it
clear which variables are the same. They are not always useful as some
data sets have meta names but no great way of extracting variables using
them. But they’re still typically useful to include in your codebook
regardless.
Below, I will demonstrate each of these.
Below, I’ll load in the codebook we will use for this study, which will include all of the above columns.
# set the path
wd <- "https://github.com/emoriebeck/R-tutorials/blob/master/06_workflow"
download.file(
url = sprintf("%s/data/codebook.csv?raw=true", wd),
destfile = sprintf("%s/data/codebook.csv", data_path)
)
# load the codebook
(codebook <- sprintf("%s/data/codebook.csv", data_path) %>%
read_csv(.) %>%
mutate(old_name = str_to_lower(old_name)))First, we need to load in the data. We’re going to use three waves of data from the German Socioeconomic Panel Study, which is a longitudinal study of German households that has been conducted since 1984. We’re going to use more recent data from three waves of personality data collected between 2005 and 2013.
Note: we will be using the teaching set of the GSOEP data
set. I will not be pulling from the raw files as a result of this. I
will also not be mirroring the format that you would usually load the
GSOEP from because that is slightly more complicated and somethng we
will return to in a later tutorial on
purrr
(link) after we have more skills. I’ve left that code for now, but
it won’t make a lot of sense right now.
path <- "~/Box/network/other projects/PCLE Replication/data/sav_files"
ref <- sprintf("%s/cirdef.sav", path) %>% haven::read_sav(.) %>% select(hhnr, rgroup20)
read_fun <- function(Year){
vars <- (codebook %>% filter(year == Year | year == 0))$old_name
set <- (codebook %>% filter(year == Year))$dataset[1]
sprintf("%s/%s.sav", path, set) %>% haven::read_sav(.) %>%
full_join(ref) %>%
filter(rgroup20 > 10) %>%
select(one_of(vars)) %>%
gather(key = item, value = value, -persnr, -hhnr, na.rm = T)
}
vars <- (codebook %>% filter(year == 0))$old_name
dem <- sprintf("%s/ppfad.sav", path) %>%
haven::read_sav(.) %>%
select(vars)
tibble(year = c(2005:2015)) %>%
mutate(data = map(year, read_fun)) %>%
select(-year) %>%
unnest(data) %>%
distinct() %>%
filter(!is.na(value)) %>%
spread(key = item, value = value) %>%
left_join(dem) %>%
write.csv(., file = "~/Documents/Github/R-tutorials/ALDA/week_1_descriptives/data/week_1_data.csv", row.names = F)This code below shows how I would read in and rename a wide-format data set using the codebook I created.
# download the file
download.file(
url = sprintf("%s/data/workflow_data.csv?raw=true", wd),
destfile = sprintf("%s/data/workflow_data.csv", data_path)
)
old.names <- codebook$old_name # get old column names
new.names <- codebook$new_name # get new column names
(soep <- sprintf("%s/data/workflow_data.csv", data_path) %>% # path to data
read_csv(.) %>% # read in data
select(old.names) %>% # select the columns from our codebook
setNames(new.names)) # rename columns with our new names